For this project I will be predicting the nightly price of Airbnb’s in Barcelona based on the features of the listing.
When traveling, sometimes it can be difficult to find the lowest price for your vacation rental. Whether you’re a student trying to save up every penny, or just want to make the most economical decisions to maximize your trip, it can be helpful to know what features to look for when searching for an Airbnb. I am interested in seeing how neighborhoods, types of properties, number of reviews, size of property, and other helpful features will affect the nightly price of Airbnb’s. The goal is to use these features to our advantage to be able to find the lowest price possible. In this project I will be investigating the effect of such variables on the outcome of nightly price.
My data comes from Kaggle, a data set called ‘Barcelona Airbnb listings - Inside Airbnb’. Originally the data was extracted from the Airbnb website by ‘Inside Airbnb’. Let’s load all of our packages and the data set we are working with!
library(tidymodels)
library(tidyverse)
library(visdat)
library(dplyr)
library(corrplot)
library(corrr)
library(visdat)
library(ranger)
library(beepr)
library(xgboost)
library(forcats)
library(vip)
tidymodels_prefer()
airbnb_raw <- read_csv('Data/barcelona_listings.csv')
Let’s take a look at what our raw data looks like!
dim(airbnb_raw)
## [1] 19833 62
It looks like we have 19,833 observations and 62 variables.
Next I’ll select the variables I want to keep in my model and filter
the data set to only include these variables. The
airbnb_raw data frame has 62 variables, many of which
containing redundant or useless information. The types of variables I
dropped are long descriptions of the property, descriptions of the host,
URL’s, etc. Here I will select the variables I want to keep to simplify
my data set.
columns <- c("host_is_superhost", "host_listings_count",
"latitude", "longitude", "property_type", 'review_scores_accuracy',
"room_type", "accommodates", "bathrooms", "bedrooms", "beds", 'square_feet',
"price", 'cleaning_fee', "minimum_nights", "availability_30",
"availability_60", "availability_90", "availability_365", "number_of_reviews_ltm",
"review_scores_rating", 'neighbourhood_group_cleansed', 'number_of_reviews',
'review_scores_cleanliness')
airbnb <- airbnb_raw %>%
select(all_of(columns))
Now let’s check out the data we’ll be using for the rest of the project!
head(airbnb)
## # A tibble: 6 × 24
## host_is_superhost host_listings_count latitude longitude property_type
## <lgl> <dbl> <dbl> <dbl> <chr>
## 1 FALSE 45 41.4 2.19 Apartment
## 2 FALSE 45 41.4 2.17 Apartment
## 3 FALSE 2 41.4 2.20 Apartment
## 4 TRUE 5 41.4 2.22 Apartment
## 5 TRUE 1 41.4 2.16 Apartment
## 6 FALSE 9 41.4 2.17 Apartment
## # ℹ 19 more variables: review_scores_accuracy <dbl>, room_type <chr>,
## # accommodates <dbl>, bathrooms <dbl>, bedrooms <dbl>, beds <dbl>,
## # square_feet <dbl>, price <chr>, cleaning_fee <chr>, minimum_nights <dbl>,
## # availability_30 <dbl>, availability_60 <dbl>, availability_90 <dbl>,
## # availability_365 <dbl>, number_of_reviews_ltm <dbl>,
## # review_scores_rating <dbl>, neighbourhood_group_cleansed <chr>,
## # number_of_reviews <dbl>, review_scores_cleanliness <dbl>
Price and cleaning_fee are dollar amounts
but they are stored as string values with the ‘$’ included. Let’s clean
them up and convert them to numeric variables.
airbnb$price <- as.numeric(gsub("[\\$,]", "", airbnb$price))
airbnb$cleaning_fee <- as.numeric(gsub("[\\$,]", "", airbnb$cleaning_fee))
airbnb$neighborhood <- airbnb$neighbourhood_group_cleansed
airbnb <- airbnb %>% select(-neighbourhood_group_cleansed)
host_is_superhost:Indicates whether the host has
superhost status.
host_listings_count:The number of listings the host has
on Airbnb.
neighbourhood:The name of the neighborhood where the
listing is located.
neighbourhood_group_cleansed:A broader category of the
neighborhood (e.g., borough or district).
zipcode:The postal code or zip code of the listing
location.
latitude and longitude:Geographic
coordinates of the listing.
property_type:The type of property (e.g., apartment,
house, villa).
room_type:The type of room (e.g., entire home, private
room, shared room).
accommodates:The maximum number of guests the listing
can accommodate.
bathrooms, bedrooms, and
beds:The number of bathrooms, bedrooms, and beds in the
listing.
square_feet:The square footage or size of the
listing.
price:The nightly price of the listing.
cleaning_fee:Any additional cleaning fee charged by the
host.
minimum_nights and maximum_nights:The
minimum and maximum number of nights guests can book.
has_availability:Indicates whether the listing is
available for booking.
availability_X:Columns indicating availability for the
next 30, 60, 90, and 365 days.
number_of_reviews:The total number of reviews the
listing has received.
number_of_reviews_ltm:The number of reviews received in
the last 12 months.
Here let’s look at a summary of our newly cleaned data set. We have
two logical variables, three character variables, and 20 numerical
variables. Our outcome, price, has a range of 7 to 1920
euros with a mean of 129 euros. We can already see here that a few of
our variables are missing many of their observations.
summary(airbnb)
## host_is_superhost host_listings_count latitude longitude
## Mode :logical Min. : 0.00 Min. :41.35 Min. :2.088
## FALSE:16097 1st Qu.: 1.00 1st Qu.:41.38 1st Qu.:2.157
## TRUE :3702 Median : 3.00 Median :41.39 Median :2.168
## NA's :34 Mean : 20.36 Mean :41.39 Mean :2.168
## 3rd Qu.: 12.00 3rd Qu.:41.40 3rd Qu.:2.178
## Max. :1411.00 Max. :41.46 Max. :2.229
## NA's :34
## property_type review_scores_accuracy room_type accommodates
## Length:19833 Min. : 2.000 Length:19833 Min. : 1.000
## Class :character 1st Qu.: 9.000 Class :character 1st Qu.: 2.000
## Mode :character Median :10.000 Mode :character Median : 2.000
## Mean : 9.401 Mean : 3.348
## 3rd Qu.:10.000 3rd Qu.: 4.000
## Max. :10.000 Max. :18.000
## NA's :4855
## bathrooms bedrooms beds square_feet
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 1.000 Median : 1.000 Median : 2.000 Median : 161.0
## Mean : 1.318 Mean : 1.568 Mean : 2.258 Mean : 461.0
## 3rd Qu.: 1.500 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 791.2
## Max. :16.000 Max. :16.000 Max. :40.000 Max. :3444.0
## NA's :9 NA's :3 NA's :34 NA's :19309
## price cleaning_fee minimum_nights availability_30
## Min. : 7.0 Min. : 0.00 Min. : 1.000 Min. : 0.000
## 1st Qu.: 40.0 1st Qu.: 10.00 1st Qu.: 1.000 1st Qu.: 0.000
## Median : 65.0 Median : 30.00 Median : 2.000 Median : 5.000
## Mean : 129.6 Mean : 43.22 Mean : 8.521 Mean : 8.245
## 3rd Qu.: 113.0 3rd Qu.: 60.00 3rd Qu.: 4.000 3rd Qu.:13.000
## Max. :9120.0 Max. :500.00 Max. :900.000 Max. :30.000
## NA's :4069
## availability_60 availability_90 availability_365 number_of_reviews_ltm
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 4.00 1st Qu.: 9.00 1st Qu.: 49.0 1st Qu.: 0.0
## Median :17.00 Median :35.00 Median :160.0 Median : 4.0
## Mean :21.73 Mean :37.14 Mean :170.5 Mean : 12.4
## 3rd Qu.:37.00 3rd Qu.:61.00 3rd Qu.:296.0 3rd Qu.: 19.0
## Max. :60.00 Max. :90.00 Max. :365.0 Max. :172.0
##
## review_scores_rating number_of_reviews review_scores_cleanliness
## Min. : 20.0 Min. : 0.00 Min. : 2.000
## 1st Qu.: 88.0 1st Qu.: 1.00 1st Qu.: 9.000
## Median : 93.0 Median : 7.00 Median :10.000
## Mean : 91.1 Mean : 32.82 Mean : 9.257
## 3rd Qu.: 97.0 3rd Qu.: 38.00 3rd Qu.:10.000
## Max. :100.0 Max. :645.00 Max. :10.000
## NA's :4847 NA's :4855
## neighborhood
## Length:19833
## Class :character
## Mode :character
##
##
##
##
Let’s see how many missing values we’re working with.
vis_miss(airbnb)
From here, we can see that review_scores_accuracy,
square_feet, cleaning_fee,
review_scores_rating, and
review_scores_cleanliness each have over 20% of their data
missing. I will drop these predictors because the proportions of missing
data are too high and they are not the most important variables for my
model.
airbnb <- airbnb %>% select(-c('review_scores_accuracy', 'square_feet', 'cleaning_fee', 'review_scores_rating', 'review_scores_cleanliness'))
vis_miss(airbnb)
Now there is less than 0.1% of data missing from the data set, only 114 values, so it is safe to just omit all missing values.
sum(is.na(airbnb))
## [1] 114
airbnb <- na.omit(airbnb)
airbnb$neighborhood <- factor(airbnb$neighborhood)
airbnb$property_type <- factor(airbnb$property_type)
airbnb$room_type <- factor(airbnb$room_type)
Now that our data is clean and tidy, let’s take a look at some of the relationships between variables and our outcome.
First, we’ll look at the distribution of price.
ggplot(airbnb, aes(x= price)) + geom_histogram(bins = 250,fill= 'mediumorchid') +
theme_minimal() + labs(title = 'Distribution of Nightly Price')
The distribution of our outcome variable, price, is
heavily skewed left, with most of the observations range from about 0 to
150 euros. The range of the response variable is from 0 to 1000 euros
per night. There were 260 observations above 1000 that expanded the
range up to 9120 euros. Since there was such a small percentage of
observations above the 1000 range (0.013%), it makes sense to drop those
observations for less of a skewed, more accurate distribution.
airbnb <- airbnb %>% filter(price < 1000)
ggplot(airbnb, aes(x= price)) + geom_histogram(bins = 250,fill= 'mediumorchid') +
theme_minimal() + labs(title = 'Distribution of Nightly Price')
Next, we’ll look at a correlation plot of our numeric variables to see if there are any notable correlations between predictors or the response.
airbnb %>%
select_if((is.numeric)) %>%
cor() %>%
corrplot(method = 'color', tl.srt = 45)
There are high positive correlations between the availability
variables, bed and bedrooms with
accommodates, and number of reviews with number of reviews
in the last 12 months. The more bedrooms an Airbnb has, it must have
more beds, and can therefore accommodate more people. With the response
variable, price, there are positive correlations with
accommodates, bathrooms,
bedrooms, and beds. This means there may be
positive linear relationships with these variables. As expected, bigger
Airbnb’s with more rooms and beds are going to be pricier per night. The
variables number_of_reviews and
number_of_reviews_ltm contain similar information, as both
track the number of reviews on the Airbnb page. The variables
availability_30, availability_60, and
availability_90 also have highly positive correlations
because they are codependent.
Let’s see a visual representation of how accommodates is
associated with price.
ggplot(airbnb, aes(y = price, x = accommodates)) + geom_jitter(width = 0.5) +
geom_smooth(method = 'lm', se = F) +
labs(title = 'Number Accommodates vs. Price') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
As expected, as the number of the people the Airbnb accommodates increases, the price also increases.
Let’s see how the neighborhood variable is distributed
to see how many values are and which are the most common.
ggplot(airbnb, aes(x = neighborhood)) + geom_bar(fill = 'mediumorchid') +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(title = 'Distribution of Neighborhoods')
This is a visual representation of the distribution of our neighborhood groups. The highest frequencies are Eixample and Ciutat Vella. There are 10 different Barcelona neighborhoods that our observations are grouped into.
Next, we’ll see a visual representation of the distribution of neighborhoods’ prices.
ggplot(airbnb, aes(y= price, fill = neighborhood)) + geom_boxplot() +
labs(title = 'Distribution of Price by Neighborhood')
From these boxplots, we can see that the doesn’t majorly impact the median price of Airbnb’s. The majority of observations are concentrated around 100 euros, so each neighborhood has many outlying observations. Nou Barris and Sant Andreu seem the have the smallest variation and lowest prices compared to the rest. Eixample appears to have some of the priciest Airbnb’s.
Property type has 27 levels, most of which only contain a few observations. Since most of these factors are so uncommon, we’ll lump the least common levels into a group called ‘Other’.
airbnb %>% group_by(property_type) %>% count() %>% arrange(n)
## # A tibble: 27 × 2
## # Groups: property_type [27]
## property_type n
## <fct> <int>
## 1 Cabin 1
## 2 Cottage 1
## 3 Dorm 1
## 4 Farm stay 2
## 5 Earth house 3
## 6 Chalet 5
## 7 Barn 6
## 8 Dome house 6
## 9 Camper/RV 7
## 10 Tiny house 8
## # ℹ 17 more rows
All property types that contain less than 100 observations are lumped
into the ‘Other’ category. This leaves us with 10 total levels of the
property_type variable. Apartments are the most common by
far, with over 16,000 observations.
# Lump levels of property_type with less than 100 observations into 'Other'
airbnb <- airbnb %>%
mutate(property_type = fct_lump(property_type, n = 9, other_level = 'Other'))
airbnb %>% group_by(property_type) %>% count() %>% arrange(n)
## # A tibble: 10 × 2
## # Groups: property_type [10]
## property_type n
## <fct> <int>
## 1 Boutique hotel 102
## 2 Guest suite 168
## 3 Hostel 178
## 4 Bed and breakfast 205
## 5 Other 353
## 6 Condominium 359
## 7 House 375
## 8 Loft 462
## 9 Serviced apartment 671
## 10 Apartment 16700
ggplot(airbnb, aes(x = property_type)) + geom_bar(fill = 'mediumorchid') +
theme(axis.text.x = element_text(angle = 45, hjust=1))
Finally we’re ready to start setting up the models! Here we’ll split the data, set up k-fold cross validation, and prepare the recipe for the models. Let’s get started!
We need to split the data into training and testing sets. The
training set is for building and training the models; the testing set
will be saved until the end and only used to test the performance of our
best model. In the random initial split, I chose to make the training
set 75% of the total data set, leaving 25% for testing. I stratified on
the response, price. It is also important to set the seed
to ensure that this experiment can be replicated.
set.seed(2435)
split <- initial_split(airbnb, prop = 0.75, strata = price)
train <- training(split)
test <- testing(split)
Set up 5 folds for cross validation, stratified on price
once again. The process of k-fold cross validation splits the training
data into k separate folds, which will then be used to train the data k
times, validating on one of the folds each time. This process will
improve the results of our estimates.
airbnb_folds <- vfold_cv(train, v = 5, strata = price)
Time to set up the recipe! First we select the response variable we
want as price and make sure to use the training data. Then
I used step_rm() to exclude beds and
number_of_reviews_ltm from the recipe to avoid
multicollinearity. The number_of_reviews_ltm has a high
correlation with number_of_reviews because they contain the
same information, just for different amounts of time. Beds has a high
correlation with bedrooms, as expected. I also dropped
availability_60 and because we already have availability
for 30 days and 1 year. This leaves us with 15 predictors and one
response variable for our model!
The next step is to dummy code all categorical predictors. Then since there are two logical predictors, we need to convert those to integers. The last steps are to center and scale all predictors.
airbnb_recipe <- recipe(price ~ ., data=train) %>%
step_rm(c(beds, number_of_reviews_ltm, availability_60, availability_90)) %>%
step_dummy(all_nominal_predictors()) %>%
step_mutate(host_is_superhost = as.integer(host_is_superhost)) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
prep(airbnb_recipe) %>% bake(new_data = train) %>% ncol()
## [1] 32
Now it’s time to build the models! I fit five models: Linear Regression, K Nearest Neighbors, Elastic Net, Random Forest, and Boosted Trees models. Since the response generally did not have high correlations with the features, I did not expect Linear Regression to be our best fit. Tree-based models like Random Forests and Boosted Trees generally perform well on non-linear models. They also handle outliers well, which will be useful for our skewed distribution.
Time to set up the models, tuning necessary parameters.
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
en_model <- linear_reg(mixture = tune(),
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
boosted_model <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
lm_wf <- workflow() %>%
add_model(lm_model) %>%
add_recipe(airbnb_recipe)
knn_wf <- workflow() %>%
add_model(knn_model) %>%
add_recipe(airbnb_recipe)
en_wf <- workflow() %>%
add_model(en_model) %>%
add_recipe(airbnb_recipe)
rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_recipe(airbnb_recipe)
boosted_wf <- workflow() %>%
add_model(boosted_model) %>%
add_recipe(airbnb_recipe)
Here, I indicated the ranges for the tuning parameters.
knn_grid <- grid_regular(neighbors(range = c(1,15)),
levels = 10)
en_grid <- grid_regular(penalty(range = c(0, 1)),
mixture(range = c(0, 1)),
levels = 10)
rf_grid <- grid_regular(mtry(range = c(1,20)),
trees(range = c(200,500)),
min_n(range = c(10,20)),
levels = 5)
boosted_grid <- grid_regular(mtry(range = c(1, 20)),
trees(range = c(200, 500)),
learn_rate(range = c(-4, -1)),
levels = 5)
Next, we fit the models to the folds we set up earlier
lm_fit <- lm_wf %>%
fit_resamples(resamples = airbnb_folds)
Tune the grids on the folds, using the workflows and grid parameters previously set up. I also saved the tuned models to my device to save time in the future and avoid rerunning the models each time.
tune_knn <- tune_grid(
object = knn_wf,
resamples = airbnb_folds,
grid = knn_grid
)
tune_en <- tune_grid(
object = en_wf,
resamples = airbnb_folds,
grid = en_grid
)
tune_rf <- tune_grid(
rf_wf,
resamples = airbnb_folds,
grid = rf_grid
)
tune_boosted <- tune_grid(
boosted_wf,
resamples = airbnb_folds,
grid = boosted_grid
)
beep()
#Save files
save(tune_knn, file = 'knn.rda')
save(tune_en, file = 'en.rda')
save(tune_rf, file = 'randforest.rda')
save(tune_boosted, file = 'boosted.rda')
Load back the tuned models so we can use then in further steps.
load('knn.rda')
load('en.rda')
load('randforest.rda')
load('boosted.rda')
Now that we’ve run all of our models and loaded them back, it’s time to assess how they’ve performed! I’ll be using the Root Mean Squared Error or RMSE to assess the results. RMSE is a measure of error which is calculated by taking the square root of the squared mean differences between predicted and observed values. Lower values of RMSE are optimal and indicate a better fit of the model.
lm_metrics <- collect_metrics(lm_fit)
en_metrics <- collect_metrics(tune_en) %>% arrange(.metric,mean)
knn_metrics <- collect_metrics(tune_knn) %>% arrange(.metric,mean)
rf_metrics <- collect_metrics(tune_rf) %>% arrange(.metric,mean)
boosted_metrics <- collect_metrics(tune_boosted) %>% arrange(.metric,mean)
Let’s take a look at each of the models’ best metrics to find the lowest RMSE.
lm_rmse <- lm_metrics[1,]
en_rmse <- en_metrics %>% slice(1)
knn_rmse <- knn_metrics %>% slice(1)
rf_rmse <- rf_metrics %>% slice(1)
boosted_rmse <- boosted_metrics %>% slice(1)
lm_rmse
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 80.8 5 2.37 Preprocessor1_Model1
en_rmse
## # A tibble: 1 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 0.111 rmse standard 80.5 5 2.37 Preprocessor1_Model011
knn_rmse
## # A tibble: 1 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 rmse standard 67.3 5 1.75 Preprocessor1_Model07
rf_rmse
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 17 350 10 rmse standard 61.2 5 2.51 Preprocessor1_Model0…
boosted_rmse
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 500 0.1 rmse standard 57.1 5 1.75 Preprocessor1_M…
rmse_means <- c(lm_rmse$mean,en_rmse$mean,knn_rmse$mean,rf_rmse$mean,boosted_rmse$mean)
model_names <- c('Linear Regression', 'Elastic Net', 'K-Nearest Neighbors', 'Random Forest', 'Boosted Trees')
rmse_comparison <- data.frame(model_names, rmse_means)
rmse_comparison
## model_names rmse_means
## 1 Linear Regression 80.82652
## 2 Elastic Net 80.53850
## 3 K-Nearest Neighbors 67.29418
## 4 Random Forest 61.22626
## 5 Boosted Trees 57.14131
It looks like our best model was Boosted Trees with an RMSE of 57.14131!
The autoplots of our models are a good way to visualize how each of the tuning parameters affected the RMSE metric and the results of the model.
autoplot(tune_en)
For Elastic Net regression, the RMSE was minimized the most with the penalty of 1 and mixture of 0.11111. Mixture had a big effect on the results of the model. As the value increased for higher levels of penalty, it spiked the RMSE.
autoplot(tune_rf)
For our Random Forest model, there is not much variation among the number of trees, meaning number of trees did not contribute a great amount the the model’s performance. The number of predictors selected, or mtry, had the biggest effect on model results. More predictors the model used, typically indicated a better fit.
autoplot(tune_boosted)
For the best model, Boosted Trees, there is more variation in number of trees and in learning rate. The best model performed with 500 trees, a high number of predictors, and a learning rate of 0.1.
Finally it’s time to select our best model and fit it to the entire training data to assess its performance!
The model I’ve decided to move forward with is the Boosted Trees model with the lowest RMSE value. Here we finalize the workflow and fit our best model to the entire training set.
best_boost <- select_best(tune_boosted, metric = 'rmse')
final_wf <- finalize_workflow(boosted_wf, best_boost)
final_fit <- fit(final_wf, data = train)
Using the final fit, we can now look at which variables contributed the most to the model.
final_fit %>% extract_fit_parsnip() %>%
vip() +
geom_bar(stat = "identity", fill = "mediumorchid") + labs(title = 'Variable Importance Plot')
According to the variable importance plot, accommodates,
host_listings_count, and bedrooms were the top
three most important predictors in the model. Unsurprisingly, we noted
earlier in the correlation plot that the number of people the Airbnb
accommodates and the price are positively correlated. I was more
surprised that the number of listings the host has on Airbnb had such
predicting power. Perhaps Airbnb superhosts rent properties at higher
prices since it is their profession rather than a source of passive
income.
Now it’s time to test the final model on the testing set that we’ve been holding out.
final_fit_test <- augment(final_fit, new_data = test)
rsq(final_fit_test, truth = price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.638
rmse(final_fit_test, truth = price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 59.6
Our model has an R squared of 0.6375, meaning that 63.75% of the variation in price can be explained by variation in the predictors. The RMSE of the testing set turned out to be 59.6437, which is slightly higher than the Boosted Trees RMSE metric from the training set. It still outperformed the other models we trained on.
Let’s see how our predictions did against the actual values of price.
The x-axis represents price, the actual values from the
testing set. The y-axis is our model’s predictions. The blue line is the
ideal path that the points should be on if the model fit perfectly. If
the predictions were all equivalent to the actual price, we would have a
slope of 1 for our best fit line. The actual best fit line is in red. We
can see that our model under-predicted most of the points since the red
line has a lower slope than the blue and many of our points fall below
the blue line.
final_fit_test %>%
ggplot(aes(x = price, y = .pred)) + geom_point(alpha = 0.5) +
geom_smooth(method = 'lm', color = 'red', linetype = 'dashed', se = F) +
geom_abline(color = 'blue', linetype = 'dashed') +
labs(title = 'Predicted vs. Actual Price')
## `geom_smooth()` using formula = 'y ~ x'
Finally all of the models are fit and we can conclude that the Boosted Trees model worked the best at predicting the price of Airbnb’s. While the model did not do a perfect job, almost 64% of the variation was able to be explained by the predictors. If the model was perfect, the \(R^2\) value would be 100%, indicating that all of the variation in the outcome was predicted by variation in the predictors. The RMSE of the final tested model was 59.65 euros, not much higher than the training RMSE.
The worst model that we fit was the linear regression model which was only able to account for 38.3% of the variation. This is most likely because our data is non-linear. As we saw earlier in the correlation plot, most of the predictors did not have high linear correlations with the response variable, price. This just means that the data does not fit a perfectly linear model and other models worked better to predict the outcome.
If I were to continue this project to improve the machine learning model, I would be interested in testing different predictors along with the ones I chose. I am interested in predicting trends in Airbnb prices as well. As I was working on this project, I began to wonder if there is a way to predict changes in Airbnb prices for a given area, say Barcelona. Perhaps with more data on the housing market and economy of Barcelona I could potentially predict changes and trends in pricing data.
Overall, in this project I was successful in creating a model that predicts the price of an Airbnb based on given factors to some accuracy. I enjoyed this process and would like to continue developing my machine learning skills to be able to create more accurate models in the future.
The data used for this project was downloaded from Kaggle: ‘Barcelona Airbnb listings - Inside Airbnb’. It was originally extracted from the Airbnb website by ‘Inside Airbnb’.